Purpose: Using MLlib from pyspark to Fit Machine Learning Models, finding the relationship between obesity and people's eating habits & their physical condition.
Dataset:
Estimation of obesity levels based on eating habits and physical condition Data Set include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level based on BMI which is calculated by height and weight), that allows classification of the data using the values of Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. 77% of the data was generated synthetically using the Weka tool and the SMOTE filter, 23% of the data was collected directly from users through a web platform.</b>
The predictors in this dataset are: Frequent consumption of high caloric food (FAVC), Frequency of consumption of vegetables (FCVC), Number of main meals (NCP), Consumption of food between meals (CAEC), Consumption of water daily (CH20), Consumption of alcohol (CALC), Calories consumption monitoring (SCC), Physical activity frequency (FAF), Time using technology devices (TUE), Transportation used (MTRANS), Gender, Age, Height and Weight.
Data cleaning and modification: No missing data were detected in the dataset, however,
To have a binary response in the latter modeling, we classified Obesity Type I, Obesity Type II and Obesity Type III as obesityYes (obsYes=1), the rest levels of NObesity as obesityYes (obsYes=0).
Supervised Learning Idea and Data Split:
Supervised learning means that a variable or variables in the data set represent an output or response variable
Generally speaking, supervised Learning try to relate predictors to a response variable through a model, including making inference on the model parameters, predicting a value or classifying an observation. The process is applying supervised learning algorithms to take a known set of input data (the learning set) and known responses to the data (the output), and forms models to generate reasonable predictions for the response to the new input data.
To identify the most ideal model for prediction, we split our data into a training and test set. The process of tuning the hyperparameter(s) and estimating the model parameters are only done iteratively in the training data. The test set is used to produce unbiased estimate of the performance for the final model chosen. The testing data can not be untouched or unseen during the training process. We must split our data into a training and test set before the model fitting. Otherwise it may result in overfitting since we have already used the test set to build the final model.
Models: We fit the data set with three different classes of models: Logistic model, Classification tree and Random Forest model. Here we are going to brief discuss the general idea of those models and how they work.
Logistic model: Logistic regression models are used mostly as a tool for data analysis and inference, where the main goal is to understand the role of the predictors in explaining the outcome. Logistic regression does not make many of the key assumptions of linear regression and general linear models that are based on ordinary least squares algorithms – particularly regarding linearity, normality, homoscedasticity, and measurement level. Our data meets all the assumptions for logistic regression. First, the response is binary. Second, the observations are different patients which are independent of each other. Third, there is no multicollinearity among the predictors as we shown in the latter correlation matrix.
Classification Tree model: The basic idea of tree models is to split up predictor space into regions. Each region represents different predictions. Classification tree is to classify or predict group memberships. For a given region, usually use most prevalent class as prediction. One main advantage of trees is that they can be displayed graphically, and are easily interpreted even by a non-expert - this is especially true for small trees. The reason is that trees are very easy to explain to people since they more closely mirror human decision-making. Also, trees can easily handle categorical predictors without the need to create dummy variables. Since our response is binary, we applied a Classification Tree to fit our data.
Random Forest model: Random Forest is based on the bagging algorithm and uses Ensemble Learning technique. Random forests provide an improvement over bagging by decorrelating the trees. It forces each split to consider only a subset of the predictors. As in bagging, we build a number of decision trees on bootstrapped training samples. When building these decision trees, for each split in a tree, a random subset of predictors is chosen as split candidates from the full set of p predictors. Generally, Random Forest will provide a better prediction performance than Classification Tree. However, a disadvantage of random forest is that the resulting model is often difficult or impossible to interpret, as we are averaging many trees rather than looking at a single tree.
Modules:
1. pandas
2. pyspark
3. matplotlib.pyplot
4. pyspark.sql
5. os
6. sys
7. pyspark.ml
### Import modules
import pandas as pd
import matplotlib.pyplot as plt
import pyspark.pandas as ps
import os
import sys
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
from pyspark.ml import Pipeline
from pyspark.ml.feature import SQLTransformer, VectorAssembler, StringIndexer, VectorIndexer, IndexToString, Interaction, StandardScaler, PCA
from pyspark.ml.classification import LogisticRegression, DecisionTreeClassifier, RandomForestClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
### Read in the data
obdt = pd.read_csv("ObesityDataSet_cleaned.csv")
obdt.head()
| Gender | Age | Height | Weight | family_history_with_overweight | FAVC | FCVC | NCP | CAEC | SMOKE | CH2O | SCC | FAF | TUE | CALC | MTRANS | NObesity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | 21.0 | 1.62 | 64.0 | yes | no | 2 | 3 | Sometimes | no | 2 | no | 0 | 1 | no | Public_Transportation | Normal_Weight |
| 1 | Female | 21.0 | 1.52 | 56.0 | yes | no | 3 | 3 | Sometimes | yes | 3 | yes | 3 | 0 | Sometimes | Public_Transportation | Normal_Weight |
| 2 | Male | 23.0 | 1.80 | 77.0 | yes | no | 2 | 3 | Sometimes | no | 2 | no | 2 | 1 | Frequently | Public_Transportation | Normal_Weight |
| 3 | Male | 27.0 | 1.80 | 87.0 | no | no | 3 | 3 | Sometimes | no | 2 | no | 2 | 0 | Frequently | Walking | Overweight_Level_I |
| 4 | Male | 22.0 | 1.78 | 89.8 | no | no | 2 | 1 | Sometimes | no | 2 | no | 0 | 0 | Sometimes | Public_Transportation | Overweight_Level_II |
### Investigate the shape
obdt.shape
(2111, 17)
originalSQL = spark.createDataFrame(obdt)
originalSQL.show(5)
+------+----+------+------+------------------------------+----+----+---+---------+-----+----+---+---+---+----------+--------------------+-------------------+ |Gender| Age|Height|Weight|family_history_with_overweight|FAVC|FCVC|NCP| CAEC|SMOKE|CH2O|SCC|FAF|TUE| CALC| MTRANS| NObesity| +------+----+------+------+------------------------------+----+----+---+---------+-----+----+---+---+---+----------+--------------------+-------------------+ |Female|21.0| 1.62| 64.0| yes| no| 2| 3|Sometimes| no| 2| no| 0| 1| no|Public_Transporta...| Normal_Weight| |Female|21.0| 1.52| 56.0| yes| no| 3| 3|Sometimes| yes| 3|yes| 3| 0| Sometimes|Public_Transporta...| Normal_Weight| | Male|23.0| 1.8| 77.0| yes| no| 2| 3|Sometimes| no| 2| no| 2| 1|Frequently|Public_Transporta...| Normal_Weight| | Male|27.0| 1.8| 87.0| no| no| 3| 3|Sometimes| no| 2| no| 2| 0|Frequently| Walking| Overweight_Level_I| | Male|22.0| 1.78| 89.8| no| no| 2| 1|Sometimes| no| 2| no| 0| 0| Sometimes|Public_Transporta...|Overweight_Level_II| +------+----+------+------+------------------------------+----+----+---+---------+-----+----+---+---+---+----------+--------------------+-------------------+ only showing top 5 rows
### split the Original dataset to be training part and test part
train, test = originalSQL.randomSplit([0.8,0.2], seed=1)
print(train.count(), test.count())
train.head()
1687 424
Row(Gender='Female', Age=15.0, Height=1.65, Weight=86.0, family_history_with_overweight='yes', FAVC='yes', FCVC=3, NCP=3, CAEC='Sometimes', SMOKE='no', CH2O=1, SCC='no', FAF=3, TUE=2, CALC='no', MTRANS='Walking', NObesity='Obesity_Type_I')
test.head()
Row(Gender='Female', Age=17.0, Height=1.75, Weight=57.0, family_history_with_overweight='yes', FAVC='yes', FCVC=3, NCP=3, CAEC='Frequently', SMOKE='no', CH2O=2, SCC='no', FAF=0, TUE=1, CALC='no', MTRANS='Public_Transportation', NObesity='Normal_Weight')
EDA is done using the Training dataset
### convert training dataset to pandas-on-spark
obesitydata = train.to_pandas_on_spark()
NObesity one way contingency table¶table = obesitydata.NObesity.value_counts(dropna = False)
print(table)
Obesity_Type_I 282 Obesity_Type_III 263 Obesity_Type_II 240 Overweight_Level_II 239 Normal_Weight 235 Overweight_Level_I 221 Insufficient_Weight 207 Name: NObesity, dtype: int64
Obesity_Type_I has the highest frequency (16.63%) and Insufficient_weight has the least subjects (12.88%).
def bar_chart (var):
''' Use SQL format of training dataset to create a contingency table of the selected variable vs `NOBesity`
subset the dataframe by the selected variable and create a bar chart '''
table1 = train.crosstab(var,"NObesity").show()
obesitydata[var].plot.bar().show()
return table1
Gender vs NObesity¶bar_chart ("Gender")
+---------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+ |Gender_NObesity|Insufficient_Weight|Normal_Weight|Obesity_Type_I|Obesity_Type_II|Obesity_Type_III|Overweight_Level_I|Overweight_Level_II| +---------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+ | Male| 70| 117| 157| 238| 0| 110| 154| | Female| 137| 118| 125| 2| 263| 111| 85| +---------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+
In female group, the biggest category is Obesity_Type_III and Obesity_Type_II has the least subjects; while in male group, the Obesity_Type_II has the most subjects and Obesity_Type_III is the smallest.
family_history_with_overweight vs NObesity¶bar_chart ("family_history_with_overweight")
+---------------------------------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+ |family_history_with_overweight_NObesity|Insufficient_Weight|Normal_Weight|Obesity_Type_I|Obesity_Type_II|Obesity_Type_III|Overweight_Level_I|Overweight_Level_II| +---------------------------------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+ | no| 111| 112| 6| 1| 0| 59| 18| | yes| 96| 123| 276| 239| 263| 162| 221| +---------------------------------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+
Subjects that have obesity family history are more likely to develop obesity or become overweight; most of insufficient weight subjects do not have family history of obesity.
FAVC vs NObesity¶bar_chart ("FAVC")
+-------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+ |FAVC_NObesity|Insufficient_Weight|Normal_Weight|Obesity_Type_I|Obesity_Type_II|Obesity_Type_III|Overweight_Level_I|Overweight_Level_II| +-------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+ | no| 42| 69| 11| 5| 1| 17| 62| | yes| 165| 166| 271| 235| 262| 204| 177| +-------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+
Most of the subjects consume high caloric food frequently regardless of their weight levels.
CAEC vs NObesity¶bar_chart ("CAEC")
+-------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+ |CAEC_NObesity|Insufficient_Weight|Normal_Weight|Obesity_Type_I|Obesity_Type_II|Obesity_Type_III|Overweight_Level_I|Overweight_Level_II| +-------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+ | Frequently| 98| 72| 6| 1| 0| 10| 14| | Always| 2| 29| 3| 1| 0| 4| 2| | no| 3| 9| 1| 1| 0| 27| 1| | Sometimes| 104| 125| 272| 237| 263| 180| 222| +-------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+
Most subjects sometimes consume food in between of meals in every weight group; interestingly, Normal and insufficient weight groups tend to consume food in between of meals more frequently than overweight and obesity groups.
SMOKE vs NObesity¶bar_chart ("SMOKE")
+--------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+ |SMOKE_NObesity|Insufficient_Weight|Normal_Weight|Obesity_Type_I|Obesity_Type_II|Obesity_Type_III|Overweight_Level_I|Overweight_Level_II| +--------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+ | no| 207| 225| 277| 230| 262| 218| 235| | yes| 0| 10| 5| 10| 1| 3| 4| +--------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+
There are a few subjects smoke in each weight group.
SCC vs NObesity¶bar_chart ("SCC")
+------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+ |SCC_NObesity|Insufficient_Weight|Normal_Weight|Obesity_Type_I|Obesity_Type_II|Obesity_Type_III|Overweight_Level_I|Overweight_Level_II| +------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+ | no| 190| 212| 280| 239| 263| 193| 235| | yes| 17| 23| 2| 1| 0| 28| 4| +------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+
There are a few subjects monitor their calorie intake in each weight group.
CALC vs NObesity¶bar_chart ("CALC")
+-------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+ |CALC_NObesity|Insufficient_Weight|Normal_Weight|Obesity_Type_I|Obesity_Type_II|Obesity_Type_III|Overweight_Level_I|Overweight_Level_II| +-------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+ | Frequently| 1| 17| 12| 2| 0| 11| 15| | no| 91| 86| 131| 55| 1| 38| 108| | Sometimes| 115| 132| 139| 183| 262| 172| 116| +-------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+
Most subjects drink alcohol occasionally in each weight group.
MTRANS vs NObesity¶bar_chart ("MTRANS")
+--------------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+ | MTRANS_NObesity|Insufficient_Weight|Normal_Weight|Obesity_Type_I|Obesity_Type_II|Obesity_Type_III|Overweight_Level_I|Overweight_Level_II| +--------------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+ | Bike| 0| 3| 0| 0| 0| 1| 0| | Automobile| 30| 36| 89| 77| 1| 51| 73| | Walking| 5| 24| 2| 1| 0| 8| 4| |Public_Transporta...| 172| 169| 188| 162| 262| 160| 161| | Motorbike| 0| 3| 3| 0| 0| 1| 1| +--------------------+-------------------+-------------+--------------+---------------+----------------+------------------+-------------------+
Most subjects use public transportation or automobile in each weight group.
obtp1 = train.filter("NObesity == 'Obesity_Type_I'").to_pandas_on_spark()
obtp2 = train.filter("NObesity == 'Obesity_Type_II'").to_pandas_on_spark()
obtp3 = train.filter("NObesity == 'Obesity_Type_III'").to_pandas_on_spark()
ow1 = train.filter("NObesity == 'Overweight_Level_I'").to_pandas_on_spark()
ow2 = train.filter("NObesity == 'Overweight_Level_II'").to_pandas_on_spark()
nw = train.filter("NObesity == 'Normal_Weight'").to_pandas_on_spark()
iw = train.filter("NObesity == 'Insufficient_Weight'").to_pandas_on_spark()
def kernel_plot (var1):
''' Porduce selected variable's description table and kernel density plot'''
sub_g = obesitydata.loc[:,[var1,"NObesity"]]
sv_des = sub_g.groupby("NObesity").describe()
obtp1[var1].plot.density(bw_method = 0.5).show()
obtp2[var1].plot.density(bw_method = 0.5).show()
obtp3[var1].plot.density(bw_method = 0.5).show()
ow1[var1].plot.density(bw_method = 0.5).show()
ow2[var1].plot.density(bw_method = 0.5).show()
nw[var1].plot.density(bw_method = 0.5).show()
iw[var1].plot.density(bw_method = 0.5).show()
return sv_des
Age¶kernel_plot ("Age")
| Age | ||||||||
|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | |
| NObesity | ||||||||
| Obesity_Type_III | 263.0 | 23.482599 | 2.782070 | 18.112503 | 21.016849 | 25.470652 | 26.000000 | 26.0 |
| Overweight_Level_I | 221.0 | 23.547466 | 6.376175 | 16.000000 | 19.621545 | 21.028500 | 26.000000 | 55.0 |
| Obesity_Type_II | 240.0 | 28.250152 | 4.757921 | 20.000000 | 24.825398 | 27.186873 | 30.684347 | 41.0 |
| Insufficient_Weight | 207.0 | 19.994768 | 2.876419 | 16.000000 | 18.000000 | 19.349258 | 21.491055 | 39.0 |
| Overweight_Level_II | 239.0 | 27.038839 | 8.301512 | 17.000000 | 21.000000 | 23.940030 | 33.000000 | 56.0 |
| Normal_Weight | 235.0 | 21.710638 | 5.033360 | 14.000000 | 19.000000 | 21.000000 | 23.000000 | 61.0 |
| Obesity_Type_I | 282.0 | 25.780438 | 7.674820 | 15.000000 | 20.654752 | 22.997168 | 29.633715 | 52.0 |
Obesity_Type_II group has the largest age mean while Insufficient_Weight group has the smallest age mean; Overweight_level_II group has the widest standard deviation and insufficient_weight group has the narrowest standard deviation.
Height¶kernel_plot ("Height")
| Height | ||||||||
|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | |
| NObesity | ||||||||
| Obesity_Type_III | 263.0 | 1.687855 | 0.064231 | 1.560000 | 1.631856 | 1.668931 | 1.746061 | 1.827730 |
| Overweight_Level_I | 221.0 | 1.686334 | 0.096284 | 1.456346 | 1.616533 | 1.679725 | 1.756774 | 1.900000 |
| Obesity_Type_II | 240.0 | 1.772933 | 0.072917 | 1.600000 | 1.750000 | 1.770278 | 1.824901 | 1.918859 |
| Insufficient_Weight | 207.0 | 1.688874 | 0.098149 | 1.520000 | 1.600000 | 1.700000 | 1.756330 | 1.900000 |
| Overweight_Level_II | 239.0 | 1.701861 | 0.089025 | 1.480000 | 1.663178 | 1.700740 | 1.750097 | 1.930000 |
| Normal_Weight | 235.0 | 1.672809 | 0.094975 | 1.500000 | 1.600000 | 1.660000 | 1.740000 | 1.930000 |
| Obesity_Type_I | 282.0 | 1.696785 | 0.098917 | 1.500000 | 1.620930 | 1.683000 | 1.781251 | 1.980000 |
Obesity_Type_II group has the largest height mean while Normal_Weight group has the smallest height mean; Obesity_level_I group has the widest standard deviation and Obesity_level_III group has the narrowest standard deviation.
Weight¶kernel_plot ("Weight")
| Weight | ||||||||
|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | |
| NObesity | ||||||||
| Obesity_Type_III | 263.0 | 120.884813 | 15.191570 | 102.000000 | 109.959714 | 112.098616 | 133.644711 | 160.935351 |
| Overweight_Level_I | 221.0 | 74.123851 | 8.434043 | 53.620604 | 68.066090 | 74.959747 | 80.000000 | 91.000000 |
| Obesity_Type_II | 240.0 | 115.324530 | 8.046202 | 93.000000 | 112.007101 | 117.757010 | 120.794535 | 129.991623 |
| Insufficient_Weight | 207.0 | 49.828436 | 5.814822 | 39.000000 | 44.810751 | 50.000000 | 52.514302 | 65.000000 |
| Overweight_Level_II | 239.0 | 81.872115 | 8.255836 | 60.000000 | 78.008388 | 81.322970 | 86.080500 | 102.000000 |
| Normal_Weight | 235.0 | 62.205532 | 9.482981 | 44.000000 | 55.000000 | 61.000000 | 69.500000 | 87.000000 |
| Obesity_Type_I | 282.0 | 93.118069 | 11.522536 | 75.000000 | 82.193405 | 90.924208 | 104.970030 | 125.000000 |
As expected, Obesity_Type_III group has the largest weight mean and the widest standard deviation while Insufficient_Weight group has the smallest weight mean and the narrowest standard deviation.
CH2O¶kernel_plot ("CH2O")
| CH2O | ||||||||
|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | |
| NObesity | ||||||||
| Obesity_Type_III | 263.0 | 2.285171 | 0.745501 | 1.0 | 2.0 | 2.0 | 3.0 | 3.0 |
| Overweight_Level_I | 221.0 | 2.076923 | 0.699650 | 1.0 | 2.0 | 2.0 | 3.0 | 3.0 |
| Obesity_Type_II | 240.0 | 1.862500 | 0.601471 | 1.0 | 1.0 | 2.0 | 2.0 | 3.0 |
| Insufficient_Weight | 207.0 | 1.874396 | 0.678141 | 1.0 | 1.0 | 2.0 | 2.0 | 3.0 |
| Overweight_Level_II | 239.0 | 2.025105 | 0.586434 | 1.0 | 2.0 | 2.0 | 2.0 | 3.0 |
| Normal_Weight | 235.0 | 1.842553 | 0.637753 | 1.0 | 1.0 | 2.0 | 2.0 | 3.0 |
| Obesity_Type_I | 282.0 | 2.109929 | 0.719807 | 1.0 | 2.0 | 2.0 | 3.0 | 3.0 |
Obesity_Type_III group has the largest mean of daily water consumption and the widest standard deviation; Normal weight group has the smallest mean and overweight_level_II group has the narrowest standard deviation.
FAF¶kernel_plot ("FAF")
| FAF | ||||||||
|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | |
| NObesity | ||||||||
| Obesity_Type_III | 263.0 | 0.642586 | 0.815880 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| Overweight_Level_I | 221.0 | 1.049774 | 0.890486 | 0.0 | 0.0 | 1.0 | 2.0 | 3.0 |
| Obesity_Type_II | 240.0 | 0.983333 | 0.659445 | 0.0 | 1.0 | 1.0 | 1.0 | 2.0 |
| Insufficient_Weight | 207.0 | 1.217391 | 0.890122 | 0.0 | 0.0 | 1.0 | 2.0 | 3.0 |
| Overweight_Level_II | 239.0 | 0.979079 | 0.881403 | 0.0 | 0.0 | 1.0 | 1.0 | 3.0 |
| Normal_Weight | 235.0 | 1.302128 | 1.007555 | 0.0 | 0.0 | 1.0 | 2.0 | 3.0 |
| Obesity_Type_I | 282.0 | 1.000000 | 0.939448 | 0.0 | 0.0 | 1.0 | 2.0 | 3.0 |
Normal weight group has the highest mean and widest standard deviation of physical activity frequency, Obesity_Type_III group has the smallest mean; Obesity_level_II group has the narrowest standard deviation.
TUE¶kernel_plot ("TUE")
| TUE | ||||||||
|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | |
| NObesity | ||||||||
| Obesity_Type_III | 263.0 | 0.665399 | 0.472750 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| Overweight_Level_I | 221.0 | 0.583710 | 0.749853 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| Obesity_Type_II | 240.0 | 0.475000 | 0.633282 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| Insufficient_Weight | 207.0 | 0.821256 | 0.725361 | 0.0 | 0.0 | 1.0 | 1.0 | 2.0 |
| Overweight_Level_II | 239.0 | 0.715481 | 0.637263 | 0.0 | 0.0 | 1.0 | 1.0 | 2.0 |
| Normal_Weight | 235.0 | 0.668085 | 0.685973 | 0.0 | 0.0 | 1.0 | 1.0 | 2.0 |
| Obesity_Type_I | 282.0 | 0.680851 | 0.748133 | 0.0 | 0.0 | 1.0 | 1.0 | 2.0 |
Insufficient_weight group has the largest mean of high technology device using time while Obesity_Type_II group has the smallest mean; overweight_level_I group has the widest standard deviation and Obesity_level_III group has the narrowest standard deviation.
sub_2= obesitydata.loc[:,["FCVC","NCP","NObesity"]]
sv_des2 = sub_2.groupby("NObesity").describe()
sv_des2
| FCVC | NCP | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
| NObesity | ||||||||||||||||
| Obesity_Type_III | 263.0 | 3.000000 | 0.000000 | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 263.0 | 3.000000 | 0.000000 | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 |
| Overweight_Level_I | 221.0 | 2.262443 | 0.525609 | 1.0 | 2.0 | 2.0 | 3.0 | 3.0 | 221.0 | 2.452489 | 1.006233 | 1.0 | 1.0 | 3.0 | 3.0 | 4.0 |
| Obesity_Type_II | 240.0 | 2.375000 | 0.621276 | 1.0 | 2.0 | 2.0 | 3.0 | 3.0 | 240.0 | 2.770833 | 0.621766 | 1.0 | 3.0 | 3.0 | 3.0 | 4.0 |
| Insufficient_Weight | 207.0 | 2.545894 | 0.628274 | 1.0 | 2.0 | 3.0 | 3.0 | 3.0 | 207.0 | 2.908213 | 0.932832 | 1.0 | 3.0 | 3.0 | 4.0 | 4.0 |
| Overweight_Level_II | 239.0 | 2.280335 | 0.511291 | 1.0 | 2.0 | 2.0 | 3.0 | 3.0 | 239.0 | 2.506276 | 0.793197 | 1.0 | 2.0 | 3.0 | 3.0 | 4.0 |
| Normal_Weight | 235.0 | 2.319149 | 0.581274 | 1.0 | 2.0 | 2.0 | 3.0 | 3.0 | 235.0 | 2.731915 | 0.901354 | 1.0 | 3.0 | 3.0 | 3.0 | 4.0 |
| Obesity_Type_I | 282.0 | 2.191489 | 0.504991 | 1.0 | 2.0 | 2.0 | 2.0 | 3.0 | 282.0 | 2.450355 | 0.826184 | 1.0 | 2.0 | 3.0 | 3.0 | 3.0 |
Obesity_Type_III group has the largest mean and 0 standard deviation for vegetable consumption and main meal frequency; Obesity_Type_I group has the smallest mean of vegetable consumption and Insufficient_weight has the widest standard deviation. Obesity_level_I group has the lowest mean of main meal frequency and Obesity_level_II group has the narrowest standard deviation.
### Correlation Matrix
indep = obesitydata[["Age","Height","Weight","FCVC","NCP","CH2O","FAF","TUE"]]
indep.corr(method='pearson')
D:\Python\Python310\lib\site-packages\pyspark\sql\context.py:125: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.
| Age | Height | Weight | FCVC | NCP | CH2O | FAF | TUE | |
|---|---|---|---|---|---|---|---|---|
| Age | 1.000000 | -0.010308 | 0.193997 | 0.004209 | -0.033733 | -0.026532 | -0.133409 | -0.279994 |
| Height | -0.010308 | 1.000000 | 0.478241 | -0.065864 | 0.243900 | 0.191648 | 0.298308 | 0.057219 |
| Weight | 0.193997 | 0.478241 | 1.000000 | 0.172750 | 0.121563 | 0.191209 | -0.042681 | -0.036470 |
| FCVC | 0.004209 | -0.065864 | 0.172750 | 1.000000 | 0.013461 | 0.072814 | 0.014576 | -0.046870 |
| NCP | -0.033733 | 0.243900 | 0.121563 | 0.013461 | 1.000000 | 0.057044 | 0.127579 | 0.007357 |
| CH2O | -0.026532 | 0.191648 | 0.191209 | 0.072814 | 0.057044 | 1.000000 | 0.127741 | -0.039267 |
| FAF | -0.133409 | 0.298308 | -0.042681 | 0.014576 | 0.127579 | 0.127741 | 1.000000 | 0.067049 |
| TUE | -0.279994 | 0.057219 | -0.036470 | -0.046870 | 0.007357 | -0.039267 | 0.067049 | 1.000000 |
Height and Weight have the strongest correlation among all the numerical predictors, we generate scatter plots for the ones with correlation coefficient greater than 0.2
ps.options.plotting.backend = 'matplotlib'
def scatter_plot (varx, vary):
'''Generate a scatter plot for selected x and y variable'''
indep.plot.scatter(x=varx, y=vary)
plt.title("The Scatter plot of " + varx + " vs " + vary)
plt.xlabel(varx)
plt.ylabel(vary)
plt.show()
scatter_plot("Age", "TUE")
scatter_plot("Height", "Weight")
scatter_plot("Height", "NCP")
scatter_plot("Height", "FAF")
### Convert training SQL to pandas
obdtr = train.toPandas()
gender = pd.get_dummies(obdtr.Gender, prefix="Gender")
family = pd.get_dummies(obdtr.family_history_with_overweight, prefix="family_history_with_overweight")
favc = pd.get_dummies(obdtr.FAVC, prefix="FAVC")
caec = pd.get_dummies(obdtr.CAEC, prefix="CAEC")
smoke = pd.get_dummies(obdtr.SMOKE, prefix="SMOKE")
scc = pd.get_dummies(obdtr.SCC, prefix="SCC")
calc = pd.get_dummies(obdtr.CALC, prefix="CALC")
mtrans = pd.get_dummies(obdtr.MTRANS, prefix="MTRANS")
df = obdtr.drop(["Gender", "family_history_with_overweight","FAVC", "CAEC", "SMOKE", "SCC","CALC", "MTRANS"], axis = 1)
df = df.join(gender).join(family).join(favc).join(caec).join(smoke).join(scc).join(calc).join(mtrans)
NObesity to a binary variable¶def obs(x):
if x=='Insufficient_Weight': return 0
if x=='Normal_Weight': return 0
if x=='Overweight_Level_I': return 0
if x=='Overweight_Level_II': return 0
if x=='Obesity_Type_I': return 1
if x=='Obesity_Type_II': return 1
if x=='Obesity_Type_III': return 1
obsYes = df['NObesity'].apply(obs)
df = df.drop(["NObesity"], axis = 1)
df = df.join(obsYes)
df.columns
Index(['Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE',
'Gender_Female', 'Gender_Male', 'family_history_with_overweight_no',
'family_history_with_overweight_yes', 'FAVC_no', 'FAVC_yes',
'CAEC_Always', 'CAEC_Frequently', 'CAEC_Sometimes', 'CAEC_no',
'SMOKE_no', 'SMOKE_yes', 'SCC_no', 'SCC_yes', 'CALC_Frequently',
'CALC_Sometimes', 'CALC_no', 'MTRANS_Automobile', 'MTRANS_Bike',
'MTRANS_Motorbike', 'MTRANS_Public_Transportation', 'MTRANS_Walking',
'NObesity'],
dtype='object')
df.head()
| Age | Height | Weight | FCVC | NCP | CH2O | FAF | TUE | Gender_Female | Gender_Male | ... | SCC_yes | CALC_Frequently | CALC_Sometimes | CALC_no | MTRANS_Automobile | MTRANS_Bike | MTRANS_Motorbike | MTRANS_Public_Transportation | MTRANS_Walking | NObesity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15.0 | 1.65 | 86.0 | 3 | 3 | 1 | 3 | 2 | 1 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
| 1 | 17.0 | 1.63 | 65.0 | 2 | 1 | 3 | 1 | 1 | 1 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2 | 17.0 | 1.65 | 67.0 | 3 | 1 | 2 | 1 | 1 | 1 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3 | 17.0 | 1.70 | 85.0 | 2 | 3 | 2 | 1 | 1 | 1 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| 4 | 18.0 | 1.56 | 51.0 | 2 | 4 | 2 | 1 | 0 | 1 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
5 rows × 31 columns
trainR = spark.createDataFrame(df)
trainR.show(5)
+----+------+------+----+---+----+---+---+-------------+-----------+---------------------------------+----------------------------------+-------+--------+-----------+---------------+--------------+-------+--------+---------+------+-------+---------------+--------------+-------+-----------------+-----------+----------------+----------------------------+--------------+--------+ | Age|Height|Weight|FCVC|NCP|CH2O|FAF|TUE|Gender_Female|Gender_Male|family_history_with_overweight_no|family_history_with_overweight_yes|FAVC_no|FAVC_yes|CAEC_Always|CAEC_Frequently|CAEC_Sometimes|CAEC_no|SMOKE_no|SMOKE_yes|SCC_no|SCC_yes|CALC_Frequently|CALC_Sometimes|CALC_no|MTRANS_Automobile|MTRANS_Bike|MTRANS_Motorbike|MTRANS_Public_Transportation|MTRANS_Walking|NObesity| +----+------+------+----+---+----+---+---+-------------+-----------+---------------------------------+----------------------------------+-------+--------+-----------+---------------+--------------+-------+--------+---------+------+-------+---------------+--------------+-------+-----------------+-----------+----------------+----------------------------+--------------+--------+ |15.0| 1.65| 86.0| 3| 3| 1| 3| 2| 1| 0| 0| 1| 0| 1| 0| 0| 1| 0| 1| 0| 1| 0| 0| 0| 1| 0| 0| 0| 0| 1| 1| |17.0| 1.63| 65.0| 2| 1| 3| 1| 1| 1| 0| 1| 0| 0| 1| 0| 0| 1| 0| 1| 0| 1| 0| 0| 0| 1| 0| 0| 0| 1| 0| 0| |17.0| 1.65| 67.0| 3| 1| 2| 1| 1| 1| 0| 0| 1| 0| 1| 0| 0| 1| 0| 1| 0| 1| 0| 0| 0| 1| 0| 0| 0| 0| 1| 0| |17.0| 1.7| 85.0| 2| 3| 2| 1| 1| 1| 0| 0| 1| 1| 0| 0| 1| 0| 0| 1| 0| 1| 0| 0| 0| 1| 0| 0| 0| 1| 0| 0| |18.0| 1.56| 51.0| 2| 4| 2| 1| 0| 1| 0| 0| 1| 0| 1| 0| 1| 0| 0| 1| 0| 1| 0| 0| 1| 0| 0| 0| 0| 1| 0| 0| +----+------+------+----+---+----+---+---+-------------+-----------+---------------------------------+----------------------------------+-------+--------+-----------+---------------+--------------+-------+--------+---------+------+-------+---------------+--------------+-------+-----------------+-----------+----------------+----------------------------+--------------+--------+ only showing top 5 rows
sqlTrans = SQLTransformer(
statement = "SELECT Age, FCVC, NCP, CH2O, FAF, TUE, Gender_Female, Gender_Male, family_history_with_overweight_no,\
family_history_with_overweight_yes, FAVC_no, FAVC_yes, CAEC_Always, CAEC_Frequently, CAEC_Sometimes, CAEC_no, SMOKE_no, SMOKE_yes, \
SCC_no, SCC_yes, CALC_Frequently, CALC_Sometimes, CALC_no, MTRANS_Automobile, MTRANS_Bike, MTRANS_Motorbike, MTRANS_Public_Transportation,\
MTRANS_Walking, NObesity as label FROM __THIS__"
)
sqlTrans.transform(trainR).show(5)
+----+----+---+----+---+---+-------------+-----------+---------------------------------+----------------------------------+-------+--------+-----------+---------------+--------------+-------+--------+---------+------+-------+---------------+--------------+-------+-----------------+-----------+----------------+----------------------------+--------------+-----+ | Age|FCVC|NCP|CH2O|FAF|TUE|Gender_Female|Gender_Male|family_history_with_overweight_no|family_history_with_overweight_yes|FAVC_no|FAVC_yes|CAEC_Always|CAEC_Frequently|CAEC_Sometimes|CAEC_no|SMOKE_no|SMOKE_yes|SCC_no|SCC_yes|CALC_Frequently|CALC_Sometimes|CALC_no|MTRANS_Automobile|MTRANS_Bike|MTRANS_Motorbike|MTRANS_Public_Transportation|MTRANS_Walking|label| +----+----+---+----+---+---+-------------+-----------+---------------------------------+----------------------------------+-------+--------+-----------+---------------+--------------+-------+--------+---------+------+-------+---------------+--------------+-------+-----------------+-----------+----------------+----------------------------+--------------+-----+ |15.0| 3| 3| 1| 3| 2| 1| 0| 0| 1| 0| 1| 0| 0| 1| 0| 1| 0| 1| 0| 0| 0| 1| 0| 0| 0| 0| 1| 1| |17.0| 2| 1| 3| 1| 1| 1| 0| 1| 0| 0| 1| 0| 0| 1| 0| 1| 0| 1| 0| 0| 0| 1| 0| 0| 0| 1| 0| 0| |17.0| 3| 1| 2| 1| 1| 1| 0| 0| 1| 0| 1| 0| 0| 1| 0| 1| 0| 1| 0| 0| 0| 1| 0| 0| 0| 0| 1| 0| |17.0| 2| 3| 2| 1| 1| 1| 0| 0| 1| 1| 0| 0| 1| 0| 0| 1| 0| 1| 0| 0| 0| 1| 0| 0| 0| 1| 0| 0| |18.0| 2| 4| 2| 1| 0| 1| 0| 0| 1| 0| 1| 0| 1| 0| 0| 1| 0| 1| 0| 0| 1| 0| 0| 0| 0| 1| 0| 0| +----+----+---+----+---+---+-------------+-----------+---------------------------------+----------------------------------+-------+--------+-----------+---------------+--------------+-------+--------+---------+------+-------+---------------+--------------+-------+-----------------+-----------+----------------+----------------------------+--------------+-----+ only showing top 5 rows
Features using VectorAssembler function¶assembler = VectorAssembler(inputCols = ["Age", "FCVC", "NCP", "CH2O", "FAF", "TUE", "Gender_Female", "Gender_Male",
"family_history_with_overweight_no", "family_history_with_overweight_yes", "FAVC_no",
"FAVC_yes", "CAEC_Always", "CAEC_Frequently", "CAEC_Sometimes", "CAEC_no", "SMOKE_no",
"SMOKE_yes", "SCC_no", "SCC_yes", "CALC_Frequently", "CALC_Sometimes", "CALC_no", "MTRANS_Automobile",
"MTRANS_Bike", "MTRANS_Motorbike", "MTRANS_Public_Transportation", "MTRANS_Walking"],
outputCol = "features", handleInvalid = 'keep')
assembler.transform(
sqlTrans.transform(trainR)
).select("label", "features").show(5)
+-----+--------------------+ |label| features| +-----+--------------------+ | 1|(28,[0,1,2,3,4,5,...| | 0|(28,[0,1,2,3,4,5,...| | 0|(28,[0,1,2,3,4,5,...| | 0|(28,[0,1,2,3,4,5,...| | 0|(28,[0,1,2,3,4,6,...| +-----+--------------------+ only showing top 5 rows
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
pipeline = Pipeline(stages = [sqlTrans, assembler, lr])
ParamGridBuilder and addGrid functions to specify the tuning parameter values and build the grid¶paramGrid = ParamGridBuilder() \
.addGrid(lr.regParam, [0]) \
.addGrid(lr.fitIntercept, [False, True]) \
.addGrid(lr.elasticNetParam, [0]) \
.build()
CrossValidator function to run 5 folds CV with the tuning parameter values and grid we set up in previous step¶crossval = CrossValidator(estimator = pipeline,
estimatorParamMaps = paramGrid,
evaluator = MulticlassClassificationEvaluator(metricName="accuracy"),
numFolds=5)
cvModel = crossval.fit(trainR)
avgMetrics Attribute and paramGrid object.¶list(zip(cvModel.avgMetrics, paramGrid))
[(0.7689267516211014,
{Param(parent='LogisticRegression_dad0d9942795', name='regParam', doc='regularization parameter (>= 0).'): 0.0,
Param(parent='LogisticRegression_dad0d9942795', name='fitIntercept', doc='whether to fit an intercept term.'): False,
Param(parent='LogisticRegression_dad0d9942795', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0}),
(0.7723689646924204,
{Param(parent='LogisticRegression_dad0d9942795', name='regParam', doc='regularization parameter (>= 0).'): 0.0,
Param(parent='LogisticRegression_dad0d9942795', name='fitIntercept', doc='whether to fit an intercept term.'): True,
Param(parent='LogisticRegression_dad0d9942795', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0})]
sqlTransN = SQLTransformer(
statement = "SELECT Age, FCVC, NCP, CH2O, FAF, TUE, NObesity as label FROM __THIS__"
)
sqlTransN.transform(trainR).show(5)
+----+----+---+----+---+---+-----+ | Age|FCVC|NCP|CH2O|FAF|TUE|label| +----+----+---+----+---+---+-----+ |15.0| 3| 3| 1| 3| 2| 1| |17.0| 2| 1| 3| 1| 1| 0| |17.0| 3| 1| 2| 1| 1| 0| |17.0| 2| 3| 2| 1| 1| 0| |18.0| 2| 4| 2| 1| 0| 0| +----+----+---+----+---+---+-----+ only showing top 5 rows
features using VectorAssembler function¶assemblerN = VectorAssembler(inputCols = ["Age", "FCVC", "NCP", "CH2O", "FAF", "TUE"], outputCol = "features", handleInvalid = 'keep')
assemblerN.transform(
sqlTransN.transform(trainR)
).select("label", "features").show(5)
+-----+--------------------+ |label| features| +-----+--------------------+ | 1|[15.0,3.0,3.0,1.0...| | 0|[17.0,2.0,1.0,3.0...| | 0|[17.0,3.0,1.0,2.0...| | 0|[17.0,2.0,3.0,2.0...| | 0|[18.0,2.0,4.0,2.0...| +-----+--------------------+ only showing top 5 rows
pipelineN = Pipeline(stages = [sqlTransN, assemblerN, lr])
CrossValidator function to run 5 folds CV with the tuning parameter values and grid we set up in previous model.¶crossvalN = CrossValidator(estimator = pipelineN,
estimatorParamMaps = paramGrid,
evaluator = MulticlassClassificationEvaluator(metricName="accuracy"),
numFolds=5)
cvModelN = crossvalN.fit(trainR)
avgMetrics Attribute and paramGrid object¶list(zip(cvModelN.avgMetrics, paramGrid))
[(0.5653649354224015,
{Param(parent='LogisticRegression_dad0d9942795', name='regParam', doc='regularization parameter (>= 0).'): 0.0,
Param(parent='LogisticRegression_dad0d9942795', name='fitIntercept', doc='whether to fit an intercept term.'): False,
Param(parent='LogisticRegression_dad0d9942795', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0}),
(0.6329532841652146,
{Param(parent='LogisticRegression_dad0d9942795', name='regParam', doc='regularization parameter (>= 0).'): 0.0,
Param(parent='LogisticRegression_dad0d9942795', name='fitIntercept', doc='whether to fit an intercept term.'): True,
Param(parent='LogisticRegression_dad0d9942795', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0})]
NObesity as label using SQLTransformer function¶sqlTransC = SQLTransformer(
statement = "SELECT Gender_Female, Gender_Male, family_history_with_overweight_no,\
family_history_with_overweight_yes, FAVC_no, FAVC_yes, CAEC_Always, CAEC_Frequently, CAEC_Sometimes, CAEC_no, SMOKE_no, SMOKE_yes, \
SCC_no, SCC_yes, CALC_Frequently, CALC_Sometimes, CALC_no, MTRANS_Automobile, MTRANS_Bike, MTRANS_Motorbike, MTRANS_Public_Transportation,\
MTRANS_Walking, NObesity as label FROM __THIS__"
)
sqlTransC.transform(trainR).show(5)
+-------------+-----------+---------------------------------+----------------------------------+-------+--------+-----------+---------------+--------------+-------+--------+---------+------+-------+---------------+--------------+-------+-----------------+-----------+----------------+----------------------------+--------------+-----+ |Gender_Female|Gender_Male|family_history_with_overweight_no|family_history_with_overweight_yes|FAVC_no|FAVC_yes|CAEC_Always|CAEC_Frequently|CAEC_Sometimes|CAEC_no|SMOKE_no|SMOKE_yes|SCC_no|SCC_yes|CALC_Frequently|CALC_Sometimes|CALC_no|MTRANS_Automobile|MTRANS_Bike|MTRANS_Motorbike|MTRANS_Public_Transportation|MTRANS_Walking|label| +-------------+-----------+---------------------------------+----------------------------------+-------+--------+-----------+---------------+--------------+-------+--------+---------+------+-------+---------------+--------------+-------+-----------------+-----------+----------------+----------------------------+--------------+-----+ | 1| 0| 0| 1| 0| 1| 0| 0| 1| 0| 1| 0| 1| 0| 0| 0| 1| 0| 0| 0| 0| 1| 1| | 1| 0| 1| 0| 0| 1| 0| 0| 1| 0| 1| 0| 1| 0| 0| 0| 1| 0| 0| 0| 1| 0| 0| | 1| 0| 0| 1| 0| 1| 0| 0| 1| 0| 1| 0| 1| 0| 0| 0| 1| 0| 0| 0| 0| 1| 0| | 1| 0| 0| 1| 1| 0| 0| 1| 0| 0| 1| 0| 1| 0| 0| 0| 1| 0| 0| 0| 1| 0| 0| | 1| 0| 0| 1| 0| 1| 0| 1| 0| 0| 1| 0| 1| 0| 0| 1| 0| 0| 0| 0| 1| 0| 0| +-------------+-----------+---------------------------------+----------------------------------+-------+--------+-----------+---------------+--------------+-------+--------+---------+------+-------+---------------+--------------+-------+-----------------+-----------+----------------+----------------------------+--------------+-----+ only showing top 5 rows
features using VectorAssembler function¶assemblerC = VectorAssembler(inputCols = ["Gender_Female", "Gender_Male", "family_history_with_overweight_no",
"family_history_with_overweight_yes", "FAVC_no", "FAVC_yes", "CAEC_Always",
"CAEC_Frequently", "CAEC_Sometimes", "CAEC_no", "SMOKE_no", "SMOKE_yes", "SCC_no",
"SCC_yes", "CALC_Frequently", "CALC_Sometimes", "CALC_no", "MTRANS_Automobile",
"MTRANS_Bike", "MTRANS_Motorbike", "MTRANS_Public_Transportation", "MTRANS_Walking"],
outputCol = "features", handleInvalid = 'keep')
assemblerC.transform(
sqlTransC.transform(trainR)
).select("label", "features").show(5)
+-----+--------------------+ |label| features| +-----+--------------------+ | 1|(22,[0,3,5,8,10,1...| | 0|(22,[0,2,5,8,10,1...| | 0|(22,[0,3,5,8,10,1...| | 0|(22,[0,3,4,7,10,1...| | 0|(22,[0,3,5,7,10,1...| +-----+--------------------+ only showing top 5 rows
pipelineC = Pipeline(stages = [sqlTransC, assemblerC, lr])
CrossValidator function to run 5 folds CV with the tuning parameter values and grid we set up in previous model.¶crossvalC = CrossValidator(estimator = pipelineC,
estimatorParamMaps = paramGrid,
evaluator = MulticlassClassificationEvaluator(metricName="accuracy"),
numFolds=5)
cvModelC = crossvalC.fit(trainR)
avgMetrics Attribute and paramGrid object¶list(zip(cvModelC.avgMetrics, paramGrid))
[(0.781304748023267,
{Param(parent='LogisticRegression_dad0d9942795', name='regParam', doc='regularization parameter (>= 0).'): 0.0,
Param(parent='LogisticRegression_dad0d9942795', name='fitIntercept', doc='whether to fit an intercept term.'): False,
Param(parent='LogisticRegression_dad0d9942795', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0}),
(0.781259163977683,
{Param(parent='LogisticRegression_dad0d9942795', name='regParam', doc='regularization parameter (>= 0).'): 0.0,
Param(parent='LogisticRegression_dad0d9942795', name='fitIntercept', doc='whether to fit an intercept term.'): True,
Param(parent='LogisticRegression_dad0d9942795', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0})]
Interaction is a Transformer which takes vector or double-valued columns, and generates a single vector column that contains the product of all combinations of one value from each input column.
SQLTransform step that was set up in the numerical logistic regression to only include numerical predictors and put response variable into label column¶Interaction function to get interactions of all the numerical predictors¶interaction = Interaction(inputCols = ["Age", "FCVC", "NCP", "CH2O", "FAF", "TUE"], outputCol = "features")
pipelineI = Pipeline(stages = [sqlTransN, interaction, lr])
CrossValidator function to run 5 folds CV with the tuning parameter values and grid we set up in previous model.¶crossvalI = CrossValidator(estimator = pipelineI,
estimatorParamMaps = paramGrid,
evaluator = MulticlassClassificationEvaluator(metricName="accuracy"),
numFolds=5)
cvModelI = crossvalI.fit(trainR)
avgMetrics Attribute and paramGrid object¶list(zip(cvModelI.avgMetrics, paramGrid))
[(0.5349141606918814,
{Param(parent='LogisticRegression_dad0d9942795', name='regParam', doc='regularization parameter (>= 0).'): 0.0,
Param(parent='LogisticRegression_dad0d9942795', name='fitIntercept', doc='whether to fit an intercept term.'): False,
Param(parent='LogisticRegression_dad0d9942795', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0}),
(0.5337168564855137,
{Param(parent='LogisticRegression_dad0d9942795', name='regParam', doc='regularization parameter (>= 0).'): 0.0,
Param(parent='LogisticRegression_dad0d9942795', name='fitIntercept', doc='whether to fit an intercept term.'): True,
Param(parent='LogisticRegression_dad0d9942795', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0})]
PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
SQLTransform step that was set up in the numerical logistic regression to only include numerical predictors and put response variable into label column¶VectorAssembler to place numerical predictor as vectors column¶assemblerP = VectorAssembler(inputCols = ["Age", "FCVC", "NCP", "CH2O", "FAF", "TUE"], outputCol = "vectors", handleInvalid = 'keep')
StandardScaler function to standardize the vectors and output them into scaleFeatures¶scaler = StandardScaler(
inputCol = 'vectors',
outputCol = 'scaledFeatures',
withMean = True,
withStd = True
)
pca = PCA(
k = 3,
inputCol = 'scaledFeatures',
outputCol = 'features'
)
pipelineP = Pipeline(stages = [sqlTransN, assemblerP, scaler, pca, lr])
CrossValidator function to run 5 folds CV with the tuning parameter values and grid we set up in previous model.¶crossvalP = CrossValidator(estimator = pipelineP,
estimatorParamMaps = paramGrid,
evaluator = MulticlassClassificationEvaluator(metricName="accuracy"),
numFolds=5)
cvModelP = crossvalP.fit(trainR)
avgMetrics Attribute and paramGrid object¶list(zip(cvModelP.avgMetrics, paramGrid))
[(0.5753834173796891,
{Param(parent='LogisticRegression_dad0d9942795', name='regParam', doc='regularization parameter (>= 0).'): 0.0,
Param(parent='LogisticRegression_dad0d9942795', name='fitIntercept', doc='whether to fit an intercept term.'): False,
Param(parent='LogisticRegression_dad0d9942795', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0}),
(0.575804211168356,
{Param(parent='LogisticRegression_dad0d9942795', name='regParam', doc='regularization parameter (>= 0).'): 0.0,
Param(parent='LogisticRegression_dad0d9942795', name='fitIntercept', doc='whether to fit an intercept term.'): True,
Param(parent='LogisticRegression_dad0d9942795', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0})]
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel")
We specify maxCategories so features with > 2 distinct values are treated as continuous.
featureIndexer = VectorIndexer(inputCol="features",
outputCol="indexedFeatures", maxCategories=2)
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
pipelineT = Pipeline(stages = [sqlTrans, assembler, labelIndexer, featureIndexer, dt])
ParamGrid for Cross Validation¶dtparamGrid = (ParamGridBuilder()
.addGrid(dt.maxDepth, [2, 5, 10, 20, 30])
.addGrid(dt.maxBins, [10, 20, 40, 80, 100])
.build())
crossvalT = CrossValidator(estimator = pipelineT,
estimatorParamMaps = dtparamGrid,
evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy"),
numFolds=5)
cvModelT = crossvalT.fit(trainR)
avgMetrics Attribute and dtparamGrid object.¶list(zip(cvModelT.avgMetrics, dtparamGrid))
[(0.7117234379256621,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 10}),
(0.7117234379256621,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 20}),
(0.7117234379256621,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 40}),
(0.7117234379256621,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 80}),
(0.7117234379256621,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 100}),
(0.7675415573786724,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 10}),
(0.7714698848084462,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 20}),
(0.7834129047514662,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 40}),
(0.7828086751442154,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 80}),
(0.7816309718388526,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 100}),
(0.8571550996481252,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 10,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 10}),
(0.8591498534058154,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 10,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 20}),
(0.8525622660522706,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 10,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 40}),
(0.8610007800165285,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 10,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 80}),
(0.8632538635279524,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 10,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 100}),
(0.8799541091278574,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 20,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 10}),
(0.8765695991233049,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 20,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 20}),
(0.8801605477293593,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 20,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 40}),
(0.879441876001624,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 20,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 80}),
(0.8716614859598726,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 20,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 100}),
(0.8787233398970881,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 30,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 10}),
(0.8765695991233049,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 30,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 20}),
(0.8807303482991597,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 30,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 40}),
(0.879441876001624,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 30,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 80}),
(0.8722312865296731,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 30,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 100})]
NObesity as label and all the predictors into features, we also use the indexing steps as we set up for the classification tree model.¶rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)
pipelineR = Pipeline(stages = [sqlTrans, assembler, labelIndexer, featureIndexer, rf])
ParamGrid for Cross Validation¶rfparamGrid = (ParamGridBuilder()
.addGrid(rf.maxDepth, [2, 5, 10, 20, 30])
.addGrid(rf.maxBins, [10, 20, 40, 80, 100])
.build())
CrossValidator function to run 5 folds CV with the grid we set up in previous step.¶crossvalR = CrossValidator(estimator = pipelineR,
estimatorParamMaps = rfparamGrid,
evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy"),
numFolds=5)
cvModelR = crossvalR.fit(trainR)
avgMetrics Attribute and dtparamGrid object.¶list(zip(cvModelR.avgMetrics, dtparamGrid))
[(0.7539470627387321,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 10}),
(0.7528455637648926,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 20}),
(0.7528455637648926,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 40}),
(0.7528455637648926,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 80}),
(0.7528455637648926,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 2,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 100}),
(0.7879359030737831,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 10}),
(0.7897355088613043,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 20}),
(0.796254373763148,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 40}),
(0.7918391640713425,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 80}),
(0.7943925633906993,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 100}),
(0.8614059172789005,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 10,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 10}),
(0.8678383868541354,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 10,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 20}),
(0.8656653164257458,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 10,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 40}),
(0.8737913955608884,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 10,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 80}),
(0.8766356256453319,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 10,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 100}),
(0.892205544388356,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 20,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 10}),
(0.8910928904063826,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 20,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 20}),
(0.8894743855417507,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 20,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 40}),
(0.9000779309111251,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 20,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 80}),
(0.8898619855524142,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 20,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 100}),
(0.892205544388356,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 30,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 10}),
(0.8910928904063826,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 30,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 20}),
(0.8894743855417507,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 30,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 40}),
(0.9000779309111251,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 30,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 80}),
(0.8898619855524142,
{Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 30,
Param(parent='DecisionTreeClassifier_2c80c59425f9', name='maxBins', doc='Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.'): 100})]
### Convert test SQL to pandas
obdts = test.toPandas()
gender = pd.get_dummies(obdts.Gender, prefix="Gender")
family = pd.get_dummies(obdts.family_history_with_overweight, prefix="family_history_with_overweight")
favc = pd.get_dummies(obdts.FAVC, prefix="FAVC")
caec = pd.get_dummies(obdts.CAEC, prefix="CAEC")
smoke = pd.get_dummies(obdts.SMOKE, prefix="SMOKE")
scc = pd.get_dummies(obdts.SCC, prefix="SCC")
calc = pd.get_dummies(obdts.CALC, prefix="CALC")
mtrans = pd.get_dummies(obdts.MTRANS, prefix="MTRANS")
dfs = obdts.drop(["Gender", "family_history_with_overweight","FAVC", "CAEC", "SMOKE", "SCC","CALC", "MTRANS"], axis = 1)
dfs = dfs.join(gender).join(family).join(favc).join(caec).join(smoke).join(scc).join(calc).join(mtrans)
NObesity to a binary variable using the function we defined in model step¶obsYes = dfs['NObesity'].apply(obs)
dfs = dfs.drop(["NObesity"], axis = 1)
dfs = dfs.join(obsYes)
dfs.columns
Index(['Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE',
'Gender_Female', 'Gender_Male', 'family_history_with_overweight_no',
'family_history_with_overweight_yes', 'FAVC_no', 'FAVC_yes',
'CAEC_Always', 'CAEC_Frequently', 'CAEC_Sometimes', 'CAEC_no',
'SMOKE_no', 'SMOKE_yes', 'SCC_no', 'SCC_yes', 'CALC_Frequently',
'CALC_Sometimes', 'CALC_no', 'MTRANS_Automobile', 'MTRANS_Bike',
'MTRANS_Motorbike', 'MTRANS_Public_Transportation', 'MTRANS_Walking',
'NObesity'],
dtype='object')
dfs.head()
| Age | Height | Weight | FCVC | NCP | CH2O | FAF | TUE | Gender_Female | Gender_Male | ... | SCC_yes | CALC_Frequently | CALC_Sometimes | CALC_no | MTRANS_Automobile | MTRANS_Bike | MTRANS_Motorbike | MTRANS_Public_Transportation | MTRANS_Walking | NObesity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.0 | 1.75 | 57.0 | 3 | 3 | 2 | 0 | 1 | 1 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| 1 | 19.0 | 1.63 | 58.0 | 3 | 3 | 2 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2 | 19.0 | 1.63 | 76.0 | 3 | 3 | 3 | 2 | 1 | 1 | 0 | ... | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 3 | 19.0 | 1.64 | 53.0 | 3 | 3 | 1 | 1 | 1 | 1 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| 4 | 19.0 | 1.65 | 61.0 | 3 | 1 | 3 | 1 | 0 | 1 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
5 rows × 31 columns
testSQL = spark.createDataFrame(dfs)
testSQL.show(5)
+----+------+------+----+---+----+---+---+-------------+-----------+---------------------------------+----------------------------------+-------+--------+-----------+---------------+--------------+-------+--------+---------+------+-------+---------------+--------------+-------+-----------------+-----------+----------------+----------------------------+--------------+--------+ | Age|Height|Weight|FCVC|NCP|CH2O|FAF|TUE|Gender_Female|Gender_Male|family_history_with_overweight_no|family_history_with_overweight_yes|FAVC_no|FAVC_yes|CAEC_Always|CAEC_Frequently|CAEC_Sometimes|CAEC_no|SMOKE_no|SMOKE_yes|SCC_no|SCC_yes|CALC_Frequently|CALC_Sometimes|CALC_no|MTRANS_Automobile|MTRANS_Bike|MTRANS_Motorbike|MTRANS_Public_Transportation|MTRANS_Walking|NObesity| +----+------+------+----+---+----+---+---+-------------+-----------+---------------------------------+----------------------------------+-------+--------+-----------+---------------+--------------+-------+--------+---------+------+-------+---------------+--------------+-------+-----------------+-----------+----------------+----------------------------+--------------+--------+ |17.0| 1.75| 57.0| 3| 3| 2| 0| 1| 1| 0| 0| 1| 0| 1| 0| 1| 0| 0| 1| 0| 1| 0| 0| 0| 1| 0| 0| 0| 1| 0| 0| |19.0| 1.63| 58.0| 3| 3| 2| 0| 0| 1| 0| 1| 0| 1| 0| 0| 0| 1| 0| 1| 0| 0| 1| 0| 0| 1| 0| 0| 0| 1| 0| 0| |19.0| 1.63| 76.0| 3| 3| 3| 2| 1| 1| 0| 0| 1| 1| 0| 0| 1| 0| 0| 0| 1| 1| 0| 0| 1| 0| 1| 0| 0| 0| 0| 0| |19.0| 1.64| 53.0| 3| 3| 1| 1| 1| 1| 0| 0| 1| 0| 1| 0| 0| 1| 0| 1| 0| 1| 0| 0| 0| 1| 0| 0| 0| 1| 0| 0| |19.0| 1.65| 61.0| 3| 1| 3| 1| 0| 1| 0| 1| 0| 0| 1| 0| 0| 1| 0| 1| 0| 0| 1| 0| 1| 0| 0| 0| 0| 1| 0| 0| +----+------+------+----+---+----+---+---+-------------+-----------+---------------------------------+----------------------------------+-------+--------+-----------+---------------+--------------+-------+--------+---------+------+-------+---------------+--------------+-------+-----------------+-----------+----------------+----------------------------+--------------+--------+ only showing top 5 rows
accuracy = MulticlassClassificationEvaluator(metricName="accuracy").evaluate(cvModel.transform(testSQL))
print("Test Error = %g " % (1.0 - accuracy))
Test Error = 0.224057
accuracyN = MulticlassClassificationEvaluator(metricName="accuracy").evaluate(cvModelN.transform(testSQL))
print("Test Error = %g " % (1.0 - accuracyN))
Test Error = 0.358491
accuracyC = MulticlassClassificationEvaluator(metricName="accuracy").evaluate(cvModelC.transform(testSQL))
print("Test Error = %g " % (1.0 - accuracyC))
Test Error = 0.238208
accuracyI = MulticlassClassificationEvaluator(metricName="accuracy").evaluate(cvModelI.transform(testSQL))
print("Test Error = %g " % (1.0 - accuracyI))
Test Error = 0.441038
accuracyP = MulticlassClassificationEvaluator(metricName="accuracy").evaluate(cvModelP.transform(testSQL))
print("Test Error = %g " % (1.0 - accuracyP))
Test Error = 0.377358
accuracyT = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy").evaluate(cvModelT.transform(testSQL))
print("Test Error = %g " % (1.0 - accuracyT))
Test Error = 0.113208
accuracyR = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy").evaluate(cvModelR.transform(testSQL))
print("Test Error = %g " % (1.0 - accuracyR))
Test Error = 0.0943396